几个数据增强方法部署了未标记的分配(UID)数据,以弥合神经网络的培训和推理之间的差距。然而,这些方法在UID数据的可用性方面具有明确的限制和伪标签上的算法的依赖性。在此,我们提出了一种数据增强方法,通过使用缺乏上述问题的分发(OOD)数据来改善对抗和标准学习的泛化。我们展示了如何在理论上使用每个学习场景中的数据来改进泛化,并通过Cifar-10,CiFar-100和ImageNet的子集进行化学理论分析。结果表明,即使在似乎与人类角度几乎没有相关的图像数据中也是不希望的特征。我们还通过与其他数据增强方法进行比较,介绍了所提出的方法的优点,这些方法可以在没有UID数据的情况下使用。此外,我们证明该方法可以进一步改善现有的最先进的对抗培训。
translated by 谷歌翻译
域的概括(DG)旨在学习通过使用来自多个相关源域的数据,其在测试时间遇到的看不见的域的性能保持较高的模型。许多现有的DG算法降低了表示空间中源分布之间的差异,从而有可能使靠近来源的看不见的域对齐。这是由分析的动机,该分析解释了使用分布距离(例如Wasserstein距离)与来源的分布距离(例如Wasserstein距离)的概括。但是,由于DG目标的开放性,使用一些基准数据集对DG算法进行全面评估是一项挑战。特别是,我们证明了用DG方法训练的模型的准确性在未见的域中,从流行的基准数据集生成的未见域有很大差异。这强调了DG方法在一些基准数据集上的性能可能无法代表其在野外看不见的域上的性能。为了克服这一障碍,我们提出了一个基于分配强大优化(DRO)的通用认证框架,该框架可以有效地证明任何DG方法的最差性能。这使DG方法与基准数据集的经验评估互补的DG方法无关。此外,我们提出了一种培训算法,可以与任何DG方法一起使用,以改善其认证性能。我们的经验评估证明了我们方法在显着改善最严重的损失(即降低野生模型失败的风险)方面的有效性,而不会在基准数据集上产生显着的性能下降。
translated by 谷歌翻译
基于神经网络的求解部分微分方程的方法由于其简单性和灵活性来表示偏微分方程的解决方案而引起了相当大的关注。在训练神经网络时,网络倾向于学习与低频分量相对应的全局特征,而高频分量以较慢的速率(F原理)近似。对于解决方案包含广泛尺度的一类等式,由于无法捕获高频分量,网络训练过程可能会遭受缓慢的收敛性和低精度。在这项工作中,我们提出了一种分层方法来提高神经网络解决方案的收敛速率和准确性。所提出的方法包括多训练水平,其中引导新引入的神经网络来学习先前级别近似的残余。通过神经网络训练过程的性​​质,高级校正倾向于捕获高频分量。我们通过一套线性和非线性部分微分方程验证所提出的分层方法的效率和稳健性。
translated by 谷歌翻译
经过认证的稳健性保证衡量模型对测试时间攻击的稳健性,并且可以评估模型对现实世界中部署的准备情况。在这项工作中,我们批判性地研究了对基于随机平滑的认证方法的对抗鲁棒性如何在遇到配送外(OOD)数据的最先进的鲁棒模型时改变。我们的分析显示了这些模型的先前未知的漏洞,以低频OOD数据,例如与天气相关的损坏,使这些模型不适合在野外部署。为了缓解这个问题,我们提出了一种新的数据增强方案,Fourimix,产生增强以改善训练数据的光谱覆盖范围。此外,我们提出了一种新规范器,鼓励增强数据的噪声扰动的一致预测,以提高平滑模型的质量。我们发现Fouriermix增强有助于消除可认真强大的模型的频谱偏差,使其能够在一系列ood基准上实现明显更好的稳健性保证。我们的评估还在突出模型的光谱偏差时揭示了当前的OOD基准。为此,我们提出了一个全面的基准套件,其中包含来自光谱域中不同区域的损坏。对拟议套件上流行的增强方法培训的模型的评估突出了它们的光谱偏差,并建立了富硫克斯训练型模型在实现整个频谱上变化下的更好认证的鲁棒性担保的优势。
translated by 谷歌翻译
无监督的域适应(UDA)通过将知识从标记的源域传送到与目标的分布不同的标记源域来实现跨域学习。但是,UDA并不总是成功,在文献中报告了几个“负转移”的几个账目。在这项工作中,我们在目标域错误上证明了一个简单的下限,这些错误符合现有的上限。我们的界定显示了最小化源域误差和边际分布不匹配的不足,因为由于可能的诱导标记功能不匹配可能增加,因此由于可能的增加而减少目标域误差。通过同一UDA方法成功,失败的简单分布进一步说明了这种不足,并且可以成功或失败,并且可以使用相同的机会。从此激励,我们提出了新的数据中毒攻击,以欺骗UDA方法进入产生大目标域错误的学习陈述。我们使用基准数据集评估这些攻击对流行的UDA方法的影响,他们以前已经证明是成功的。我们的结果表明,中毒可以显着降低目标域精度,在某些情况下将其降至近0%,在源域中添加了10%中毒数据。这些UDA方法的失败在保证与我们下限符合的跨域泛化时,他们的局限性阐述了它们的局限性。因此,评估诸如数据中毒等对逆势设置中的UDA方法提供了更好的稳健性对UDA不利的数据分布。
translated by 谷歌翻译
字体或字体的样式通常与特定印象相关联,例如沉重,当代或优雅。这表明字体形状与其印象之间存在某些相关性。要了解相关性,本文意识到​​附近嵌入了字体及其印象的共享潜在空间。难度是附着在字体上的印象词往往非常嘈杂。这是因为印象词是非常主观和多样化的。更重要的是,一些印象词与字体形状没有直接相关,并且会扰乱共享潜空间的实现。因此,我们使用DepeSets来增强形状相关的单词并在训练共享潜空间时自动抑制形状无关的单词。具有大型字体 - 印象数据集的定量和定性实验结果表明,所提出的方法的共享潜在空间适当描述了相关性,特别是对于形状相关的印象词。
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Knowledge graph embedding (KGE), which maps entities and relations in a knowledge graph into continuous vector spaces, has achieved great success in predicting missing links in knowledge graphs. However, knowledge graphs often contain incomplete triples that are difficult to inductively infer by KGEs. To address this challenge, we resort to analogical inference and propose a novel and general self-supervised framework AnKGE to enhance KGE models with analogical inference capability. We propose an analogical object retriever that retrieves appropriate analogical objects from entity-level, relation-level, and triple-level. And in AnKGE, we train an analogy function for each level of analogical inference with the original element embedding from a well-trained KGE model as input, which outputs the analogical object embedding. In order to combine inductive inference capability from the original KGE model and analogical inference capability enhanced by AnKGE, we interpolate the analogy score with the base model score and introduce the adaptive weights in the score function for prediction. Through extensive experiments on FB15k-237 and WN18RR datasets, we show that AnKGE achieves competitive results on link prediction task and well performs analogical inference.
translated by 谷歌翻译
When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译
Vision-language models (VLMs) that are pre-trained on large-scale image-text pairs have demonstrated impressive transferability on a wide range of visual tasks. Transferring knowledge from such powerful pre-trained VLMs is emerging as a promising direction for building effective video recognition models. However, the current exploration is still limited. In our opinion, the greatest charm of pre-trained vision-language models is to build a bridge between visual and textual domains. In this paper, we present a novel framework called BIKE which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We propose a Video Attribute Association mechanism which leverages the Video-to-Text knowledge to generate textual auxiliary attributes to complement video recognition. ii) We also present a Temporal Concept Spotting mechanism which uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner to yield enhanced video representation. The extensive studies on popular video datasets (ie, Kinetics-400 & 600, UCF-101, HMDB-51 and ActivityNet) show that our method achieves state-of-the-art performance in most recognition scenarios, eg, general, zero-shot, and few-shot video recognition. To the best of our knowledge, our best model achieves a state-of-the-art accuracy of 88.4% on challenging Kinetics-400 with the released CLIP pre-trained model.
translated by 谷歌翻译